Using automatic stress extraction from audio for improved prosody modelling in speech synthesis

نویسندگان

György Szaszák

András Beke

Gábor Olaszy

Bálint Tóth

چکیده

Generating proper and natural sounding prosody is one of the key interests of today’s speech synthesis research. An important factor in this effort is the availability of a precisely labelled speech corpus with adequate prosodic stress marking. Obtaining such a labelling constitutes a huge effort, whereas interannotator agreement scores are usually found far below 100%. Stress marking based on phonetic transcription is an alternative, but yields even poorer quality than human annotation. Applying an automatic labelling may help overcoming these difficulties. The current paper presents an automatic approach for stress detection based purely on audio, which is used to derive an automatic, layered labelling of stress events and link them to syllables. For proof of concept, a speech corpus was extended by the output of the stress detection algorithm and a HMM-TTS system was trained with the extended corpus. Results are compared to a baseline system, trained on the same database, but with stress marking obtained from textual transcripts after applying a set of linguistic rules. The evaluation includes CMOS tests and the analysis of the decision trees. Results show an overall improvement in prosodic properties of the synthesized speech. Subjective ratings reveal a voice perceived as more natural.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Building of Synthetic Voices from Audio Books

Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These...

متن کامل

Automatic Parameters Estimation of the D. Klatt Phoneme Duration Model

Phoneme duration modelling is one of the stages in prosody modelling for text-to-speech systems. The rule-based phoneme duration model proposed by Klatt (1979) is still quite a popular method. One of themain shortcomings of thismethod is that the values of the parameters are selected in an experimental way. This work proposes a new iterative algorithm for the automatic estimation of the factors...

متن کامل

Intonation Modelling for Speech Synthesis and Emphasis Preservation

Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thes...

متن کامل

A very low bit rate speech coder based on a recognition/synthesis paradigm

Recent studies have shown that a concatenative speech synthesis system with a large database produces more natural sounding speech. We apply this paradigm to the design of improved very low bit rate speech coders (sub 1000 b/s). The proposed speech coder consists of unit selection, prosody coding, prosody modification and waveform concatenation. The encoder selects the best unit sequence from a...

متن کامل

Simulating Intonation in Regional Varieties of Swedish

Within the research project SIMULEKT (Simulating Intonational Varieties of Swedish), our recent work includes two approaches to simulating intonation in regional varieties of Swedish. The first involves a method for modelling intonation using the SWING (SWedish INtonation Generator) tool, where annotated speech samples are resynthesised with rule-based intonation and audio-visually analysed wit...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Using automatic stress extraction from audio for improved prosody modelling in speech synthesis

نویسندگان

چکیده

منابع مشابه

Automatic Building of Synthetic Voices from Audio Books

Automatic Parameters Estimation of the D. Klatt Phoneme Duration Model

Intonation Modelling for Speech Synthesis and Emphasis Preservation

A very low bit rate speech coder based on a recognition/synthesis paradigm

Simulating Intonation in Regional Varieties of Swedish

عنوان ژورنال:

اشتراک گذاری